Systems and methods for obtaining an artificial intelligence model in a parallel configuration

ABSTRACT

A system may include multiple client devices and a processing device communicatively coupled to the client devices. A client device may receive an initial artificial intelligence (AI) model, use a training dataset to perform an AI task, and update its AI model. The client device may verify the performance of the AI task to determine whether to accept or reject its updated AI model. Upon rejection, the client device may repeat updating its AI model until the updated AI model is accepted, or until a stopping criteria is met. The processing device may be configured to update the initial AI models based on the accepted updated AI models obtained in the multiple client devices, and repeat the process for each client device using the updated initial AI models. Training data for each of the client devices may contain a subset shuffled from a larger training dataset.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims the filing benefit of U.S. ProvisionalApplication No. 62/793,835, filed Jan. 17, 2019. This application isincorporated by reference herein in its entirety and for all purposes.

FIELD

This patent document relates generally to systems and methods forproviding artificial intelligence solutions. Examples of determining anartificial intelligence model in a parallel configuration for loadinginto one or more artificial intelligence chips for performing artificialintelligence tasks are provided.

BACKGROUND

Artificial intelligence solutions are emerging with the advancement ofcomputing platforms and integrated circuit solutions. For example, anartificial intelligence (AI) integrated circuit (IC) may include aprocessor capable of performing AI tasks in embedded hardware.Hardware-based solutions, as well as software solutions, still encounterthe challenges of obtaining an optimal AI model, such as a convolutionalneural network (CNN). A CNN may include multiple convolutional layers,and a convolutional layer may include multiple weights. Given theincreasing size of the CNN that can be embedded in an IC, a CNN mayinclude hundreds of layers and may include millions of weights. Forexample, the weights for an embedded CNN inside an AI chip may take aslarge as a few megabytes of data. This makes it difficult to obtain anoptimal CNN model because a large amount of computing time is needed.

BRIEF DESCRIPTION OF THE DRAWINGS

The present solution will be described with reference to the followingfigures, in which like numerals represent like items throughout thefigures.

FIG. 1 illustrates an example system in accordance with various examplesdescribed herein.

FIG. 2 illustrates a diagram of an example process of obtaining anoptimal AI model in a parallel configuration in accordance with variousexamples described herein.

FIG. 3 illustrates a diagram of an example process of obtaining anoptimal AI model that may be implemented in a host device in a parallelconfiguration in accordance with various examples described herein.

FIG. 4A illustrates a diagram of an example process of obtaining a localAI model that may be implemented in a client device in accordance withvarious examples described herein.

FIG. 4B illustrates a diagram of an example process of using an AI chipto perform an AI task in accordance with various examples describedherein.

FIG. 5 illustrates a diagram of an example process of obtaining anoptimal AI model in a parallel configuration in accordance with variousexamples described herein.

FIG. 6 illustrates a diagram of an example process of obtaining a localAI model that may be implemented in a client device in accordance withvarious examples described herein.

FIGS. 7A-7D illustrate various methods of obtaining training data in aparallel configuration in accordance with various examples describedherein.

FIG. 8 illustrates various embodiments of one or more electronic devicesfor implementing the various methods and processes described herein.

DETAILED DESCRIPTION

As used in this document, the singular forms “a”, “an”, and “the”include plural references unless the context clearly dictates otherwise.Unless defined otherwise, all technical and scientific terms used hereinhave the same meanings as commonly understood by one of ordinary skillin the art. As used in this document, the term “comprising” means“including, but not limited to.”

Each of the terms “artificial intelligence logic circuit” and “AI logiccircuit” refers to a logic circuit that is configured to execute certainAI functions such as a neural network in AI or machine learning tasks.An AI logic circuit can be a processor. An AI logic circuit can also bea logic circuit that is controlled by an external processor and executescertain AI functions.

Each of the terms “integrated circuit,” “semiconductor chip,” “chip,”and “semiconductor device” refers to an integrated circuit (IC) thatcontains electronic circuits on semiconductor materials, such assilicon, for performing certain functions. For example, an integratedcircuit can be a microprocessor, a memory, a programmable array logic(PAL) device, an application-specific integrated circuit (ASIC), orothers. An integrated circuit that contains an Al logic circuit isreferred to as an AI integrated circuit.

The term “AI chip” refers to a hardware- or software-based device thatis capable of performing functions of an AI logic circuit. An AI chipcan be a physical IC. For example, a physical AI chip may include anembedded cellular neural network (CeNN), which may contain weightsand/or parameters of a CNN. The AI chip may also be a virtual chip,i.e., software-based. For example, a virtual AI chip may include one ormore processor simulators to implement functions of a desired AI logiccircuit.

The term of “AI model” refers to data that include one or moreparameters that, when loaded inside an AI chip, are used for executingthe AI chip. For example, an AI model for a given CNN may include theweights, biases, and other parameters for one or more convolutionallayers of the CNN. Here, the weights and parameters of an AI model areinterchangeable.

FIG. 1 illustrates an example system in accordance with various examplesdescribed herein. In some examples, a communication system 100 includesa communication network 102. Communication network 102 may include anysuitable communication links, such as wired (e.g., serial, parallel,optical, or Ethernet connections) or wireless (e.g., Wi-Fi, Bluetooth,or mesh network connections), or any suitable communication protocolsnow or later developed. In some scenarios, system 100 may include one ormore host devices, e.g., 110, 112, 114, 116. A host device maycommunicate with another host device or other devices on the network102. A host device may also communicate with one or more client devicesvia the communication network 102. For example, host device 110 maycommunicate with client devices 120 a , 120 b , 120 c , 120 d , etc.Host device 112 may communicate with client devices 130 a , 130 b , 130c , 130 d , etc. Host device 114 may communicate with client devices 140a , 140 b , 140 c , etc. A host device, or any client device thatcommunicates with the host device, may have access to one or moredatasets used for obtaining an AI model. For example, host device 110 ora client device such as 120 a , 120 b , 120 c , or 120 d may have accessto dataset 150. 100201 In FIG. 1, a client device may include aprocessing device. A client device may also include one or more AIchips. In some examples, a client device may be an AI chip. The AI chipmay be a physical AI IC. The AI chip may also be software-based, such asa virtual AI chip that includes one or more process simulators tosimulate the operations of a physical AI IC. A processing device mayinclude an AI chip and contain programming instructions that will causethe AI chip to be executed in the processing device. Alternatively,and/or additionally, a processing device may also include a virtual AIchip, and the processing device may contain programming instructionsconfigured to control the virtual AI chip so that the virtual AI chipmay perform certain AI functions. In FIG. 1, each client device, e.g.,120 a , 120 b , 120 c , 120 d may be in electrical communication withother client devices on the same host device, e.g., 110, or clientdevices on other host devices.

In some examples, the communication system 100 may be a centralizedsystem. System 100 may also be a distributed or decentralized system,such as a peer-to-peer (P2P) system. For example, a host device, e.g.,110, 112, 114, and 116, may be a node in a P2P system. In a non-limitingexample, a client devices, e.g., 120 a , 120 b , 120 c , and 120 d mayinclude a processor and an AI physical chip. In another non-limitingexample, multiple AI chips may be installed in a host device. Forexample, host device 116 may have multiple AI chips installed on one ormore PCI boards in the host device or in a USB cradle that maycommunicate with the host device. Host device 116 may have access todataset 156 and may communicate with one or more AI chips via PCIboard(s), internal data buses, or other communication protocols such asuniversal serial bus (USB).

In some scenarios, the AI chip may contain an AI model for performingcertain AI tasks. Examples of an AI task may include image recognition,voice recognition, object recognition, data processing and analyzing, orany recognition, classification, processing tasks that employ artificialintelligence technologies. In some examples, an AI model may include aforward propagation neural network, in which information may flow fromthe input layer to one or more hidden layers of the network to theoutput layer. For example, an AI model may include a convolutionalneural network (CNN) that is trained to perform voice or imagerecognition tasks. A CNN may include multiple convolutional layers, eachof which may include multiple parameters, such as weights and/or otherparameters. In such case, an AI model may include parameters of the CNNmodel. In some examples, a CNN model may include weights, such as a maskand a scalar for a given layer of the CNN model. For example, a kernelin a CNN layer may be represented by a mask that has multiple values inlower precision multiplied by a scalar in higher precision. In someexamples, a CNN model may include other parameters. For example, anoutput channel of a CNN layer may include one or more bias values that,when added to the output of the output channel, adjust the output valuesto a desired range.

In a non-limiting example, in a CNN model, a computation in a givenlayer in the CNN may be expressed by Y=w*X+h, where X is input data, Yis output data , w is a kernel, and b is a bias; all variables arerelative to the given layer. Both the input data and the output data mayhave a number of channels. Operation “*” is a convolution. Kernel w mayinclude binary values. For example, a kernel may include 9 cells in a3×3 mask, where each cell may have a binary value, such as “1” and “−1.”In such case, a kernel may be expressed by multiple binary values in the3×3 mask multiplied by a scalar. In other examples, for some or allkernels, each cell may be a signed 2, 3, 5, or 8 bit integer. Other bitlength or values may also be possible. The scalar may include a valuehaving a bit width, such as 12-bit or 16-bit. Other bit length may alsobe possible. Alternatively, and/or additionally, a kernel may containdata with non-binary values, such as 7-value. The bias h may contain avalue having multiple bits, such as 18 bits. Other bit length or valuesmay also be possible. In a non-limiting example, the output Y may befurther discretized into a signed 6-bit or 11-bit integer. Other bitlength or values may also be possible.

In the case of physical AI chip, the AI chip may include an embeddedcellular neural network that has memory containing the multipleparameters in the CNN. In some scenarios, the memory in a physical AIchip may be a one-time-programmable (OTP) memory that allows a user toload a CNN model into the physical AI chip once. Alternatively, aphysical AI chip may have a random access memory (RAM), magnetoresistiverandom access memory (MRAM), or other types of memory that allows a userto update and load a CNN model into the physical AI chip multiple times.

In the case of virtual AI chip, the AI chip may include a data structurethat simulates the cellular neural network in a physical AI chip. Inother examples, a virtual AI chip may directly execute an AI logiccircuit without needing to simulate a physical AI chip. A virtual AIchip can be particularly advantageous when higher precision is needed,or when there is a need to compute layers that cannot be accommodated bya physical AI chip.

In the case of a hybrid AI chip, part of an AI logic circuit can becomputed using a physical AI chip, while the remainder can be computedwith a virtual chip. In a non-limiting example, the physical AI chip mayimplement all convolutional, MaxPool, and some of the ReLU layers, whilethe virtual AI chip implements other layers. This is useful becausephysical AI chips can greatly accelerate the computations of someconvolutional layers, without needing to accommodate every possiblelayer.

In some examples, a host device may compute one of more layers of a CNNbefore sending the output to a physical AI chip. In some examples, thehost device may use the output of a physical AI chip to compute outputof an AI task. For example, a host device may receive the output of theconvolution layers of a CNN from a physical AI chip and perform theoperations of the fully connected layers.

With further reference to FIG. 1, a host device on a communicationnetwork as shown in FIG. 1 (e.g., 110) may include a processing deviceand contain programming instructions that, when executed, will cause theprocessing device to access a dataset, e.g., 150, for example, trainingdata. The training data may be provided for use in obtaining the AImodel. In doing so, the AI model may be trained depending on theapplication. For example, training data may be used for training an AImodel that is suitable for face recognition tasks, and may contain anysuitable dataset collected for performing face recognition tasks. Inanother example, training data may be used for training an AI modelsuitable for scene recognition in video and images, and may contain anysuitable scene dataset collected for performing scene recognition tasks.In some scenarios, training data may reside in a memory in a hostdevice. In one or more other scenarios, training data may reside in acentral data repository and is available for access by a host device(e.g., 110, 112, 114 in FIG. 1) or a client device (e.g., 120 a -d, 130a -d, 140 a -d in FIG. 1) via the communication network 102. In someexamples, system 100 may include multiple test sets, such as datasets150, 152, 154. A CNN model may be obtained by using the multiple devicesin a communication system such as shown in FIG. 1. Details are furtherdescribed with reference to FIGS. 2-7.

FIG. 2 illustrates a diagram of an example process of obtaining anoptimal AI model in a parallel configuration in accordance with variousexamples described herein. A process 200 for training an AI model may beimplemented in a processing device, such as a host device. The process200 may perform various functions, in one or more devices, such asDevice 1, Device 2, . . . , Device N. In some examples, the process 200may include providing training configuration parameters at 202,providing training data at 204, and/or providing initial AI models at206. For example, process 204 may retrieve training dataset and providea subset data shuffled from the training dataset. These processes 202,204, 206 prepare data and AI model that may be used by a process foreach of the multiple devices Device 1, Device 2, . . . , or Device N.The process 200 may include multiple iterations, which may stop when aniteration stopping criteria is met at 222. During each iteration, a newtraining process may be started for each of the multiple devices. Eachof the multiple devices may include an AI chip for running an AI taskwith an AI model inside the AI chip. Each of the multiple devices mayupload an AI model to the AI chip in the device. A device may also becapable of updating the AI model. In a training process, training datafor each of the multiple devices may contain a subset data shuffled froma larger training dataset.

In some examples, a process that may be in one of the multiple devices,such as Device 1, may include obtaining training data at 208, runningthe AI chip in the device at 210, determining a performance value of therun at 212, where the performance value may be indicative of theperformance of the AI model used in the run. Running an AI chip mayinclude executing a physical AI chip. For example, the AI chip mayinclude a CeNN, in which case, running the AI chip may includeperforming an AI task (e.g., recognition task) using the parameters(including weights) of the CeNN. Similarly, running an AI chip mayinclude executing a virtual chip. For example, the virtual AI chip mayinclude a CNN, in which case, running the AI chip may include performingone or more convolutions using the weights and parameters of the CNN.The process for each device may further include updating the AI model at214 and determining whether to accept the updated AI model at 216. Theprocess 200 may repeat updating the AI model at 214, until the updatedAI model is accepted at 216. Upon acceptance of the updated AI model,the process for each of the multiple devices may output the respectiveupdated AI model of the device. The process 200 may further updatetraining configuration parameters at 218, and determine an optimal AImodel from among the multiple devices at 220, and repeat the trainingprocess for the multiple devices until the iteration stopping criteriais met at 222. Examples of boxes 208-216 are further described in detailin FIG. 4A.

When the stopping criteria is met, the process 200 may validate theoptimal AI models at 224 and obtain an optimal AI model at 226. In eachiteration before the stopping criteria is met at 222, processes 208-216may be implemented in any of the devices, such as Device 1, Device 2,etc. As shown in FIG. 2, multiple devices (e.g., AI chips) may runmultiple training processes in parallel, each of which produces arespective updated AI model, where the training data for each device maybe shuffled in variable ways, to be explained further. The trainingprocesses for multiple devices may be performed in parallel, andmoderated based on the behaviors of the other devices. For example, thetraining process for each device may be based on a different trainingdataset (e.g., non-overlapping or overlapping) depending on how thetraining datasets are shuffled. The training process for the multipledevices may also be based on the updated training configurationparameters, which may determine how the acceptance is determined (e.g.,at 216). The acceptance criteria may also be determined, at least inpart, based on the performance values of the current and updated AImodels. In some examples, the acceptance criteria may be determined, atleast in part, based on the average parameters (e.g., weights) of AImodels in one or more of the multiple devices.

Various boxes in FIG. 2 may be implemented in either a host device or aclient device, or a combination thereof. Without limiting the scope ofthe disclosure, FIG. 3 illustrates a diagram of an example process ofobtaining an optimal AI model that may be implemented in a host devicein a parallel configuration in accordance with various examplesdescribed herein. In some examples, a host device (such as 110 inFIG. 1) may be configured to implement one or more training processesfor one or more client devices (e.g., one or more AI chips) to which thehost device is communicating (e.g., 120 a , 120 b , 120 c , 120 d underhost device 110, or one or more AI chips under host device 116) to causeeach of the multiple client devices to determine a respective AI model.In a parallel configuration, such as shown in FIG. 2, the multipledevices (e.g., Device 1, Device 2, . . . , Device N) may be configuredto each determine an AI model in parallel. While a training process inthe parallel configuration may include one or more iterations, at eachiteration, the AI models updated from multiple devices may becommunicated to the host device. The host device may receive the AImodels and associated performance values from the multiple devices andassess the performance values among the multiple devices. The hostdevice may determine an optimal AI model based on the performance valuesof the multiple devices. The host device may also update the trainingconfiguration parameters for the next iteration. The host device maytransmit the updated training configuration parameters and the optimalAI model at the current iteration back to each of the multiple devicesto be used in the next iteration. In the next iteration, the host devicemay continue receiving the updated AI models from the multiple devices,where the updated AI models are generated in the multiple devices basedon the updated training configuration parameters and the optimal AImodel obtained from the previous iteration. The host device may repeatthe iterations until a stopping criteria is met.

In FIG. 3, in some examples, a process 300 may be implemented in a hostdevice (e.g., 110, 112, 114 in FIG. 1). The process 300 may implementone or more functions in FIG. 2 in a host device, whereas one or moreother functions in FIG. 2 can be implemented in FIG. 4A in a clientdevice. For example, the process 300 may provide training data at 302.The process 300 may also include providing training configurationparameters at 304. The process 300 may also include providing initial AImodels at 306 to the client devices. In some examples, the initial AImodels may include multiple initial AI models, each for a respectiveclient device or an AI chip (under the host device). The initial AImodels may be identical, or different among different client devices(e.g., AI chips). Once a client device or an AI chip receives arespective initial AI model, that client device or AI chip may executean AI task using the initial AI model to generate a respective updatedAI model; this process may further be described in FIG. 4A. In someexamples, an Al model may include multiple parameters (e.g., weights andother parameters of a CNN model) for use in running an AI chip in theclient device.

In some examples, the training data may include one or more trainingdatasets. Each dataset may include training data for obtaining an AImodel for use in performing an AI task. For example, a first trainingdataset may include training data for training an AI model for use inrecognizing a cat's face; and a second training dataset may includetraining data for training an AI model for use in recognizing a dog'sface. In some examples, a training dataset may include one or moresubsets of training data. For example, in a training dataset forrecognizing a cat's face, a first subset may include training datacollected over a first period of time, e.g., during a first monthperiod. A second subset may include training data collected over asecond period of time, e.g., during a second month period. In someexamples, a subset of training data may include training data arrangedin other suitable ways, such as data collected by time, by the breed ofcats being pictured, by the imaging devices (e.g., a camera or a mobilephone) being used in collecting the data etc. Other suitable division oftraining data may also be possible. In some examples, the training datamay include the pictures that include one or more cat faces, or no catfaces, and the ground truth data may include the classificationsassociated with the pictures, such as the class (e.g., the breed of acat) to which each picture or a cat face in a picture belongs.

In some examples, the training configuration parameters may include oneor more data values that may be used to adjust a training process. In anon-limiting example, the training configuration parameters may includedata values such as β and γ, which may be used by each client device inobtaining a local optimal AI model. This process will be described infurther detail in FIG. 4A.

In providing the various data, such as the training data, the trainingconfiguration parameters, or the initial AI models, to one or moreclient devices and/or AI chips, in some examples, the host device maytransmit the data to the multiple devices via a communication protocol,e.g., TCP/IP, Wi-Fi, Bluetooth, serial or parallel communications, orother communication protocols, wired or wirelessly. In some examples,the training data may be provided to the multiple devices via adatabase, such as a data repository, which is accessible by one or moreof the multiple devices, where a device may retrieve a portion of thetraining data from the database.

With further reference to FIG. 3, process 300 may include receivingupdated AI models at 308 from the one or more client devices (e.g., AIchips). In some examples, a client device may return an updated AI modelto the host device. The host device may subsequently receive multiple AImodels, each from a client device. Process 300 may subsequentlydetermine an optimal AI model at 310 based on the updated AI models ofone or more client devices and a performance value associated with eachAI model. The process 300 may also update training configurationparameters at 312. The process 300 may repeat 308, 310, and/or 312 for anumber iterations until the iteration count has exceeded a thresholdT_(C) at 316 and/or the time duration of the process has exceeded athreshold T_(D) at 318. At each iteration, the iteration countincrements at 314. Other stopping criteria may also be possible. At eachiteration, process 300 continues receiving updated AI models from theclient devices at 308 and determining the optimal AI model at 310.

Let M″_(i,0), M″_(i,1), . . . , M″_(i,N−1) represent the updated AImodel from each client device 0, 1, 2, . . . N−1, respectively, at ithiteration, where N represents the number of client devices. A model Mmay include one or more parameters of the CNN model, including weightsand other parameters, such as the bias values. Model M may have anysuitable data structure. For example, model M may include a flatone-dimensional (1D) data structure that holds the CNN parameters andweights sequentially from a few bytes to a few megabytes or more. Theparameters (including weights) of a CNN model may depend on the AI taskfor which the AI model is to be obtained, and the dataset for performingthe AI task using the AI chip. For example, an AI task having differentcomplexity levels may require different sets of CNN weights.

Let A″_(i,0), A″_(i,1), . . . , A″_(i,N−1) stand for the performancevalue of the updated AI model from each client device at the ithiteration. In some examples, a performance value A may include a singlevalue measured as the recognition accuracy associated with an AI modelM, such as the updated AI model from a client device. For example,A″_(i,0) may stand for the performance of model M″_(i,0) and have avalue of 0.5. If H_(i) stands for the optimal AI model at ith iteration,then H_(i) may be determined based on the received updated AI models andassociated performance values from one or more client devices. In anon-limiting example, a host device may determine the optimal AI modelfor that host device by selecting a received updated AI model that hasthe best performance value among all client devices. For example, if theperformance value represents the accuracy of recognition using an AImodel, then selecting the best performance includes selecting an AImodel that has the highest performance value among all client devices.

Although it is illustrated that, at each iteration, the optimal AI modelmay be determined based on the received AI models and associatedperformance values from one or more client devices, other variations maybe possible. For example, the optimal AI model may be determined basedon criteria other than the best performance value. In some examples, theoptimal AI model may be determined based on the performance value of asubset of the client devices. For example, the process may select amongtop five of a total of ten client devices, or remove the bottom twoclient devices, in terms of performance value of the AI model associatedwith each client device.

Returning to FIG. 3, in updating the training configuration parametersat 312, the process may adjust the training configuration parameters viaan annealing process. For example, the configuration parameters mayinclude data values β and γ, which may be increased exponentially. Insome examples, each of the values β and γ may increase by a range duringthe entire training process. To achieve the full (max) value for β and γduring the training process, each iteration in the process 300 mayincrease the values by a small incremental amount. In a non-limitingexample, β may be increased from an initial value of 1 to a value of 3.In an example, γ may be increased from an initial value of 0.1 to avalue of 2. If the maximum number of iterations (e.g., T_(C)) is, forexample, 3400, then, at each iteration, β may be multiplied byapproximately 1.0003233. At each iteration, γ may be multiplied byapproximately 1.0008817.

At each iteration, before the stopping criteria is met at 316, 318,processes 308, 310 and 312 may repeat. At each iteration, process 300may update the initial AI models for the client device(s) with thedetermined optimal AI model from 310 from a previous iteration, thus thetraining process in each client device may “restart.” In other words,process 300 may determine the optimal AI model at 310, update thetraining configuration parameters at 312, and cause the training processat a client device to “restart.” For example, before repeating receivingupdated Al models at 308, the process 300 may transmit the updatedoptimal AI model and updated training configuration parameters obtainedfrom e.g., 310 and 312 in a preceding iteration, to the client devicesand wait for the updated AI models from the client devices. A clientdevice may receive the optimal AI model and the updated trainingconfiguration parameters determined by the host device (e.g., 310, 312),where each client device may use the optimal AI model determined from310 as an initial AI model, and perform a training process based on theupdated initial AI model. The details will be further disclosed in FIG.4A.

As another non limiting example, at each iteration, process 308 mayinstead update the initial AI models for the client device(s) with arespective previously output AI model for that client. If each of theclient device(s) has a record of the AI model it last outputted, process308 may instead equivalently neglect to update the initial AI model,since the client devices have already updated themselves. The determinedoptimal AI model from 310 can be stored for future use. For example, itcan be used as another AI model to choose from on the next iteration ofprocess 310. In other words, process 300 may determine the optimal AImodel at 310 and store it. The process 300 may update the trainingconfiguration parameters at 312, and cause the training process at aclient device to receive its previous output AI model with the updatedtraining configuration parameters determined by the host device (e.g.,310, 312), where each client device may perform a training process basedon the updated initial AI model. The details will be further disclosedin FIG. 4A.

With further reference to FIG. 3, once the stopping criteria is met, theprocess 300 may end the iterations and further validate the optimal AImodel from the multiple client devices at 320. In some examples, invalidating the optimal AI model (e.g., the optimal AI model determinedfrom 310), the process 320 may further evaluate the received updated AImodels (from 308) along with the optimal AI model determined from 310.For example, if there are 10 client devices (e.g., N=10 in FIG. 2), thenumber of AI models to be evaluated is 11. The process 320 may determinea selected number of optimal AI models from the AI models beingevaluated, based on the performance value associated with each AI model.In some examples, the process 320 may select top five AI models. In someexamples, in evaluating the AI models, the process 320 may use avalidation dataset. The validation dataset may be independent from thetraining dataset. The validation dataset may also include a portion ofthe training dataset.

Additionally, the process 320 may further evaluate the selected numberof optimal AI models using the entire training dataset, and determine afinal optimal AI model that has the best performance value. In someexamples, the performance value associated with an AI model may be anaccuracy of the AI model. In some examples, the performance value mayinclude other criteria, such as computation time for an AI model to berun in an AI chip, or the accuracy of the AI model, or a combinationthereof. In the example above, the process 320 may further evaluate theselected top five optimal AI models and determine an optimal AI modelthat has the best performance value among the top optimal AI models.Upon determining the AI model with the best performance value, process320 will have validated the optimal AI model and output the optimal AImodel at 322. Here, the optimal AI model after the validation at 320 maybe the same optimal AI model from 310 or may be different from theoptimal AI model prior to validation.

Once the final optimal AI model is determined, the process 300 mayupload the optimal AI model at 324 into one or more client devices(e.g., AI chips) for performing future AI tasks. In some examples, theoptimal AI model may be shared among multiple processing devices on thenetwork, in which any device may load the optimal AI model into anembedded CeNN of an AI chip and execute the CeNN to perform an AI task,based on the loaded optimal AI model.

Now FIG. 4A illustrates a diagram of an example process of obtaining alocal AI model in a training process that may be implemented in a clientdevice in accordance with various examples described herein. In someexamples, a process 400 may be implemented in a client device, a hostdevice and/or an AI chip, such as shown in FIG. 1. The process 400 maytrain an AI model via one or more iterations. In some examples, theprocess 400 may implement one or more functions in FIG. 2 in a clientdevice, whereas one or more functions in FIG. 2 may be implemented inFIG. 3 in a host device. For example, the process 400 may implementboxes 208-216 in Device 1 (in FIG. 2), or in other devices such asDevice 2, . . . , Device N. At the beginning of the training process,the process 400 may include obtaining training data at 402, receivingtraining configuration parameters at 404, and/or receiving an (initial)AI model at 406. For example, the training dataset may be residing atany of the devices (host or client devices) on the communication network(e.g., 102 in FIG. 1) and may be accessible to any other devices. Insome examples, 402, 404, 406 may occur at the start of each iteration inthe process 300 (e.g., 308 in FIG. 3). The process 400 may run an AIchip to infer the performance of the AI model at 408. For example, inrunning the AI chip, the process 408 may load the AI model into the AIchip and execute the AI chip to perform an AI task, using the trainingdata from 402. The process 400 may further determine the performancevalue of the AI model at 410 by evaluating the result generated from theAI chip based on the AI model. 100481 With further reference to FIG. 4A,the process 400 may start the iteration at 412. For example, at eachiteration, the process may include updating the AI model at 412 based onthe current AI model. At the start of the iteration, the current AImodel may be the initial AI model received at 406 (from a host device,for example). During subsequent iterations, the current AI model may bethe last updated AI model obtained from 412.

In some examples, the process 400 may update the AI model at 412 byvarious methods. For example, the process 412 may generate an updated AImodel by incurring a perturbation to the initial AI model. For example,at the mth iteration in process 400, an updated AI model for clientdevice i may be represented as M_(i_m)=M_(i_m−1)+ΔM, where ΔM is theperturbation. In some examples, process 400 may include a differentprocess in which a small change to the parameters of the AI model ismade. In some examples, an AI model may include a 1D column vector,which contains all of the weights and/or parameters of the AI modelarranged sequentially in 1D. When an AI model is represented by a 1Dcolumn vector, a subtraction of two AI models may include a 1D columnvector containing multiple parameters, each of which is a subtraction oftwo corresponding parameters in the ID column vectors that represent thetwo AI models, respectively. An addition of two AI models may includemultiple parameters, each of which is a sum of two correspondingparameters in the two AI models. An average of multiple AI models mayinclude parameters, each of which is an average of the correspondingparameters in the multiple AI models. Similarly, an AI model may beincremented (added or subtracted) by a perturbation. The resulting modelmay contain multiple parameters, each of which includes a correspondingparameter in the AI model incremented (added or subtracted) by acorresponding parameter in the perturbation. In some examples, anaddition of two AI models may be in discrete or finite field. Forexample, the addition of scalars and biases in two (or multiple) CNNmodels may be done in a real coordinate space, subject to capping attheir respective minimum and maximum values. In another example, theaddition of masks in multiple CNN models may be done in finite field, inwhich each cell in the resulting mask may take a value from said finitefield.

Returning to block 412 in FIG. 4A, updating the AI model may includeupdating one or more parameters of the AI model with a probability tochange and an amplitude of change for a group of parameters, such asscalar, mask and bias in a CNN model. For example, the probabilities tochange the scalar, the mask and the bias may each be 0.01, 0.001, and0.01, respectively. The amplitude of change for scalar and bias may be0.001. In an example implementation, the process may generate a randomnumber, e.g., in the range of 0 and 1.0, and compare the random numberto the probabilities for the group of parameters. If the random numberis below the probability for a given group of parameters, that group ofparameters may change according to the amplitude of change. In case ofthe previous example, a random number may be generated. If the randomnumber is less than 0.01, the process may subsequently change the scalarby 0.001. In changing the values in a mask, the process may change eachvalue in the mask to its neighboring value. For example, if a value in amask is a binary having two values {+1, −1}, each change of value maybecome a switching between the two values (−1 or +1).

In some examples, the process may also enumerate the weight indices(e.g., 1, 2, 3 etc.), or shuffle one or more weights randomly.Additionally, and/or alternatively, the process may sequentially flipthe weight corresponding to each index for each iteration, and startover once a weight has been accepted (to be further explained). Updatingthe AI model at 412 may result in one or more parameters (includingweights) of the AI model changed. These weights and/or parameters may beviewed as proposed weights, subject to acceptance or rejection, which isfurther explained.

With further reference to FIG. 4A, the process 400 may further includeinferring the performance of the updated AI model (one or more proposedweights and/or parameters) by running the AI chip in the client devicebased on the updated (proposed) AI model at 414 and determining theperformance value of the updated AI model at 416. In some examples,running the AI chip in the client device may include causing the AI chipto execute an AI task in the AI chip where an embedded CeNN of the AIchip contains the updated AI model, such as a CNN. In other words, ifthe AI chip is a hardware-based chip, the weights and/or parameters ofthe updated AI model are loaded into the CeNN of the AI chip to be usedin performing the AI task. An AI task may depend on the dataset. Forexample, a dataset may include training data obtained at 402. For arecognition task using the training data, a performance value may bemeasured against the AI model being used. For example, an accuracy valuemay be determined at 416 based on the result of a given recognition taskusing the updated AI model.

Upon determining the performance value of the updated AI model, process400 may further determine whether to accept the updated AI model basedon the inferred performance of the updated model as described in 414,416. If it is determined that the updated AI model is rejected, theprocess 400 may repeat updating the AI model at 412, until the updatedAI model is accepted. In some examples, each of the rejected updated AImodel may be abandoned. In other words, if an updated AI model isrejected, the process 400 may repeat updating AI model at 412 based onthe AI model before the rejected AI model rather than the rejected AImodel. If it is determined that the updated AI model is accepted, theprocess 400 may output the updated AI model at 420. For example, theprocess 400 may communicate the output AI model to the host device toreceive (e.g., 308 in FIG. 3).

In determining whether to accept or reject an updated AI model, theprocess 418 may determine to accept the updated AI model based on aprobability, which indicates a probability that the updated AI model beaccepted. This probability may be determined based on the performancevalue of the current AI model and the updated AI model. In someexamples, the probability for accepting the updated AI model may also bebased on the weights and/or parameters of other client devices. In anon-limiting example, if the weights of an AI model have binary values,the probability may be determined as:

$p = {\left( e^{- {\beta {({{E{(w^{\prime_{r}})}} - {E{(w^{r})}}})}}} \right)\frac{\cosh \left( {\gamma \left( {w_{i} + {2w_{i}^{\prime_{r}}}} \right)} \right)}{\cosh \left( {\gamma \; {\overset{\_}{w}}_{i}} \right)}}$

where β and γ are the training configuration parameters. w^(r) are theweights of the current AI model, where r stands for the rth clientdevice. For example, if there are N client devices participating in thetraining in parallel, then r is in the range of {1, 2, . . . N}. w_(i)^(r) stands for the ith weight of the current AI model in the rth clientdevice (e.g., AI chip), where i is in the range of {1, 2, . . . W},where W is the number of weights and/or parameters in the AI model, suchas a CNN model. Similarly, w^(r) are the weights and/or parameters ofthe updated AI model for the rth client device. The sum of weightsand/or parameters among the multiple client devices are defined as

${\overset{\_}{w}}_{i} = {\sum\limits_{r = 1}^{N}\; {w_{i}^{r}.}}$

E(w^(r)) may stand for the performance value of the current AI model.For example, E( ) may stand for the number of incorrectly classifiedsamples given the training data obtained (e.g., in 402). In someexamples, E( ) may stand for 1-the accuracy of recognitions of the AImodel. As shown in the equation above, the probability may differ foreach weight i in the AI model. In some examples, if multiple weightshave been updated (e.g., at 412), the probability of accepting theupdated AI model may include a product of the cosh terms for themultiple weights.

In some examples, some of the weights and/or parameters of an AI modelmay have a non-binary value, i.e., more than two values. In such case,as an example, the probability for accepting the undated AI model may bedefined as:

$p = {\left( e^{{- {\beta {({{E{(w^{\prime_{r}})}} - {E{(w^{r})}}})}}} - {\gamma {({{(w_{i}^{\prime_{r}})}^{2} - {(w_{i}^{r})}^{2}})}}} \right)\frac{\Sigma_{k}e^{- {\gamma {({{Rk}^{2} - {2{k{({w_{i} + w_{i}^{\prime_{r}} - w_{i}^{r}})}}}})}}}}{\Sigma_{k}e^{- {\gamma {({{Rk}^{2} - {2{kw}_{i}}})}}}}}$

where k is summed over all allowed values of u_(i) ^(r). In anon-limiting example, if the weights may include a 2-bit signed integer,then k may be summed over {±1, 0}. In another non-limiting example, ifthe weights may include a 12-bit unsigned integer, then k may be summedover {0, 1, 2, . . . , 4095}. In some examples, if multiple weights havebeen updated (e.g., at 412), the probability of accepting the updated AImodel may include a product of the summed terms over k the multipleweights. Similarly, the elements next to y in the exponent will besummed over i for all changed weights.

In some examples, the client devices may not all be equally fast or maynot update the AI models equally frequently (e.g., some may reject morethan others). In some examples, a client device may choose to wait untilthe weights and/or parameters in all client devices are updated, andcalculate the value

${\overset{\_}{w}}_{i} = {\sum\limits_{r = 1}^{N}\; w_{i}^{r}}$

(synchronous update). In another non limiting example, a client devicemay choose to not wait, and asynchronously use available weights fromother devices to calculate

${\overset{\_}{w}}_{i} = {\sum\limits_{r = 1}^{N}\; w_{i}^{r}}$

(asynchronous update). In some examples, some client devices may choosesynchronous updates, while other client devices may choose asynchronousupdates. In some examples, a client device may be configured to performsynchronous update or asynchronous update alternately for differentiterations in a training process, e.g., the process 300 in FIG. 3.

In an example implementation, the process 418 may generate a randomnumber, e.g., in the range of 0 and 1.0, and compare the random numberto the probabilities for accepting the updated AI model. If the randomnumber does not exceed the probability, that process may determine thatthe updated AI model is accepted. Otherwise, the process may continuewithout accepting the updated AI model.

With further reference to FIG. 4A, if it is determined that the updatedAI model is accepted, process 400 may proceed and output the AI model at420. In some examples, the process may return the updated AI model tothe host device in which the training process 400 is implemented. If itis determined that the updated AI model is not accepted, the process mayrepeat the iteration at 412 and continue generating updated AI modelsuntil it is accepted. This iteration, which continues until an updatedAI model is accepted, may be referred to as a greedy approach in thatthe client device keeps trying until an updated AI model is accepted.

FIG. 4B illustrates a diagram of an example process of using an AI chipto perform an AI task in accordance with various examples describedherein. Once an optimal AI model is determined, such as in the trainingprocess 300 (in FIG. 3), the optimal AI model may be uploaded into an AIchip (e.g., 324 in FIG. 3) for performing future AI tasks. Any of theclient devices, or a client device having an AI chip may be configuredto implement a process, such as process 450. The process 450 may includereceiving an AI model at 451, where the received AI model may beuploaded to the AI chip in the client device after a training process iscomplete. The process 450 may also include receiving data from one ormore sensors at 452. For example, the received data may be capturedaudio or images from a mobile phone camera, or an audio or videocapturing device. The process 450 may run the AI chip to perform an AItask, such as a recognition task at 454 to generate a recognitionresult, and output the recognition result at 456.

In a non-limiting example, a CNN model may be obtained via a trainingprocess in a parallel configuration, such as disclosed in FIG. 3 andFIG. 4A, and it may be loaded into the AI chip for execution. Forexample, respective weights and/or parameters of a CNN model that aretrained for face recognition tasks may be loaded into an embedded CeNNin the AI chip. A host or client device may cause the AI chip to performvarious AI tasks using the trained weights and/or parameters. Forexample, a client device may feed an input image into an AI chip andreceive an image recognition result from the AI chip. The recognitionresult may indicate which class the input image belongs to. In anon-limiting example, the CNN model may be capable of recognizing one ormore classes from an input image, such as a cry and a smile face. In anexample application, an AI chip may be installed in a camera and storeweights and/or parameters of the CNN model. The AI chip may beconfigured to receive a captured image from the camera, perform an imagerecognition task based on the captured image and the stored CNN model,and output the recognition result. In outputting the recognition result,the camera may display, via a user interface, the recognition result.For example, the CNN model may be trained for face recognition. Acaptured image may include one or more facial images associated with oneor more persons. The recognition result may include the names associatedwith each input facial image. The user interface may display a person'sname next to or overlaid on each of the input facial image associatedwith the person.

FIG. 5 illustrates a diagram of an example process of obtaining anoptimal AI model in a parallel configuration in accordance with variousexamples described herein. In some examples, a process 500 may beimplemented to train an AI model. The process 500 may have a similarparallel configuration as that in the process 200 (in FIG. 2). As shownin FIG. 5, a process for each of the devices: Device 1, Device 2, . . ., Device N, is implemented in a similar fashion as in FIG. 2, excepteach device may use a less greedy approach in generating the updated AImodel. For example, in some examples, the process 500 may includeproviding training configuration parameters at 502, providing trainingdata at 504, and/or providing initial AI models at 506. For example,process 504 may retrieve training dataset and provide a subset datashuffled from the larger training dataset. These processes 502, 504, 506prepare data and AI model that may be used by a process for each of themultiple devices Device 1, Device 2, or Device N. The process 500 mayinclude multiple iterations, which may stop when an iteration stoppingcriteria is met at 522. During each iteration, a new training processmay be started for each of the multiple devices. Each of the multipledevices may include an AI chip for running an AI task with an AI modelinside the AI chip. Each of the multiple devices may upload an AI modelto the AI chip in the device. A device may also be capable of updatingthe AI model. In a training process, training data for each of themultiple devices may contain a subset data shuffled from a largertraining dataset.

In some examples, a process for one of the multiple devices may includeobtaining training data at 508, running the AI chip in the device at510, determining a performance value of the run at 512, where theperformance value may be indicative of the performance of the AI modelused in the run. The process for each device may include multipleiterations, which stop when a maximum iteration count has been reachedat 515. In each iteration, the process 500 may further include updatingthe AI model at 514. If the maximum iteration count has not beenreached, the process may determine whether to accept the updated AImodel at 516. If it is determined that the updated AI model is notaccepted, the process may repeat the iteration by updating the AI modelat 514. If it is determined that the updated model is accepted, theprocess may determine and cache an optimal AI model at 517 beforerepeating the iteration at 514.

If the maximum iteration count has been reached, the process may outputthe cached optimal AI model of each device. The process 500 may furtherupdate training configuration parameters at 518, and determine anoptimal AI model from among the multiple devices at 520, and repeat thetraining process for the multiple devices until the iteration stoppingcriteria is met at 522. When the stopping criteria is met at 522, theprocess 500 may validate the optimal AI models at 524 and obtain anoptimal AI model at 526. The details of the process 500 may further beexplained in detail in FIG. 6.

In comparing FIG. 2 with FIG. 5, the processes in each of the devices(e.g., Device 1, Device 2, . . . , Device N) differ in that the process200 in FIG. 2 is considered greedier because it continues updating theAI model until accepted, whereas the process 500 in FIG. 5 may stop theiterations in searching for updated AI model after a maximum iterationcount has been reached. In some examples, the processes as shown inFIGS. 2 and 5 may be implemented alternately in a single trainingprocess. For example, at a given iteration, a host device may choose toimplement 214, 216 (the greedier approach in FIG. 2). In a subsequentiteration in the same training process, the host may implement 514-517(in FIG. 5). As a non-limiting example, a process may choose toimplement the greedier approach (e.g., 214, 216 in FIG. 2) at every 20iterations, and implement processes 514-517 (in FIG. 5) at all otheriterations. Alternatively, and/or additionally, a process may choose toimplement the greedier approach (e.g., 214, 216 in FIG. 2) at the lastfew iterations, and implement processes 514-517 (in FIG. 5) at all otheriterations. Other configurations may also be possible.

Now FIG. 6 illustrates a diagram of an example process of obtaining alocal AI model in a training process that may be implemented in clientdevice in accordance with various examples described herein. A process600 may be implemented in a process for each of the multiple devices inFIG. 5. In some examples, the process 600 may be implemented in a clientdevice, a host device and/or an AI chip, such as shown in FIG. 1. Theprocess 600 may train an AI model via one or more iterations. At thebeginning the training process, the process 600 may include obtainingtraining data at 602, receiving training configuration parameters at604, and/or receiving an (initial) AI model at 606. Boxes 602, 604 and606 may be similar to boxes 402, 404 and 406, respectively in FIG. 4A.The process 600 may run an AI chip based on the (initial) AI model at608 to infer the performance of the AI model. For example, in runningthe AI chip, the process 608 may load the AI model into the AI chip andexecute the AI chip to perform an AI task, such as a recognition task,using the training data from 602. The process 600 may further determinethe performance value of the AI model at 610 by evaluating the resultgenerated from the AI chip based on the AI model.

With further reference to FIG. 6, the process 600 may start theiteration at 612. For example, at each iteration, the process mayinclude updating the AI model at 612 based on the current AI model. Atthe start of the iteration, the current AI model may be the initial AImodel received at 606 (from a host device, for example). Duringsubsequent iterations, the current AI model may be replaced by the lastupdated AI model obtained from 612. In some examples, the process mayupdate the AI model at 612 by various methods in a similar fashion asdescribed in 412 (in FIG. 4A).

With further reference to FIG. 6, the process 600 may further determinewhether a maximum iteration count has been reached at 614. This may beless greedy than the process 400 (in FIG. 4A) in that the process stopswhen a stopping criteria is met, without necessarily waiting for theupdated AI model to be accepted. If the maximum iteration count has notbeen reached at 614, the process 600 may also infer the performance ofthe updated AI model (one or more proposed weights and/or parameters) byrunning the AI chip in the client device to perform an AI task based onthe updated (proposed) AI model at 616. In some examples, running the AIchip in the client device may include causing the AI chip to execute anAI task, such as a recognition task (e.g., face recognition, voicerecognition, object recognition etc.) in the AI chip where an embeddedCeNN of the AI chip contains the updated AI model, such as a CNN. Inother words, if the AI chip is a hardware-based chip, the weights and/orparameters of the updated AI model are loaded into the CeNN of the AIchip to be used in performing the recognition task. A recognition taskmay depend on the dataset. For example, a dataset may include trainingdata obtained at 602. In some examples, the process 600 may alsodetermine the performance value of the updated AI model at 617. Forexample, for an AI recognition task using the training data, aperformance value may be measured against the updated AI model beingused. For example, an accuracy value may be determined at 617 based onthe result of a given recognition task using the updated AI model.

Upon determining the performance value of the updated AI model, process600 may further determine whether to accept the updated AI model at 618based on the inferred performance of the updated model from 617. If itis determined that the updated AI model is accepted, the process 600 maydetermine an optimal AI model at 620 and repeat updating the AI model at612, until the maximum iteration count is reached at 614. In determiningthe optimal AI model, the process may cache a local optimal AI modelbased on the performance values from each previous iteration, andprogressively compare the performance value of the updated AI model withthe cached local optimal AI model as the AI model is updated. If theperformance value of the updated AI model is higher than that of thelocal optimal AI model, the local optimal AI model is replaced by theupdated AI model and cached; otherwise, the local optimal AI modelremains unchanged. If it is determined that the updated AI model isrejected, the process 600 may repeat updating the AI model at 612.

Returning to box 614, if the maximum iteration count has been reached,then the process 600 may output the cached optimal AI model at 622. Forexample, the process 600 may communicate the output AI model to the hostdevice to cause the host device to start box 518 (in FIG. 5).

In determining whether to accept or reject an updated AI model, theprocess 618 may determine the probability of acceptance in a similarmanner as described in process 418 in FIG. 4A. In an exampleimplementation, the process 618 may generate a random number, e.g., inthe range of 0 and 1.0, and compare the random number to theprobabilities for accepting the updated AI model. If the random numberdoes not exceed the probability, that process may determine that theupdated AI model is accepted. Otherwise, the process may continuewithout accepting the updated AI model. The probability for acceptingthe updated AI model may be determined similar to the process describedin FIG. 4A.

FIGS. 7A-7D illustrate various methods of obtaining training data in aparallel configuration in accordance with various examples describedherein. These methods of obtaining training data may be implemented in ahost device or a client device, and may be applicable to the variousprocesses described in FIGS. 1-6, such as 204 in FIG. 2, 302 in FIG. 3,402 in FIG. 4A, 504 in FIG. 5, or 602 in FIG. 6. In some examples, asshown in FIG. 7A, a training dataset may include multiple subsets oftraining data. Each of the client devices participating in the trainingmay obtain a respective subset of the multiple subsets. In FIGS. 7B and7C, each of the client devices may obtain one or more subsets of themultiple subsets of training data. In FIG. 7D, multiple client devicesparticipating in the training process may obtain multiple trainingdatasets. For example, device 1 and device 2 may each use one or moresubsets of training data in training dataset I, whereas devices 3-6 mayeach use a respective subset of training data in training dataset II. Insome examples, as different training datasets may not be easilyaccessible to all devices, multiple host devices may be used to eachhandle a training dataset, where each host device may include one ormore client devices performing the training in parallel (e.g., process200 in FIG. 2, or process 500 in FIG. 5). Yet, as shown in the processesin FIGS. 2-6, the multiple client devices under each host device may becommunicable to share their updated AI model at a given iteration duringthe training.

In an example implementation, user A is in California and user B is inNew York. User A has a first training dataset containing pictures 1-4,and User B has a second training dataset containing pictures 5-8, wherenone of them has access to the training dataset of the other because itmay not be practical to send each other all the training datasets due tolimited network bandwidth, storage limitations, and/or privacy issues.In such case, both User A and User B may each proceed with their owntraining process in parallel, such as process 200 (in FIG. 2) by usingseparate training datasets, and each user may have one or more clientdevices (e.g., AI chips), such as Device 1, Device 2, . . . , Device N.In each iteration of the training process (e.g., boxes 208-222), thetraining data for each of the multiple devices are drawn from a largertraining dataset without overlapping (e.g., 402 in FIG. 4). If thelarger training dataset has become empty, the previously drawn data maybe shuffled and reused.

In a non-limiting example, during the first iteration, User A's device 1may use pictures 1 and 3; device 2 may use pictures 1 and 4; and so on.User B's device 1 may use pictures 7 and 8; device 2 may use pictures 5and 6; and so on. During the second iteration, User A's device 1 usespictures 2 and 4; device 2 uses pictures 2 and 3; and so on. User B'sdevice 1 uses pictures 5 and 6; device 2 uses 7 and 8; and so on. Duringthe third iteration, all datasets are exhausted, and hence old data maybe shuffled and reused. Each device may draw data in the same manner asin the first iteration. In some examples, during subsequent iterations,the training data may be further shuffled. For example, User A may usepictures 3 and 4 in the training. User B may use pictures 6 and 7 in thetraining. In another iteration, User A may use pictures 1 and 2 in thetraining. User B may use pictures 5 and 8 in the training. The shufflingof training data may vary.

In some examples, the subsets of training data may be randomly shuffledduring each iteration. In some examples, the amount of training datadrawn from each dataset may vary. For example, Users A and B may drawhalf of the training dataset each time. In another example, differentusers may choose to draw some other fraction of the dataset, including apartial dataset or the entire dataset. As shown in the above example,multiple devices may participate in parallel in a training process,e.g., 200 (FIG. 2), 500 (FIG. 5). However, multiple devices may usenon-overlapping or overlapping training data. In some examples, multipledevices may also use entirely separate training datasets.

It is appreciated that the disclosures of various embodiments in FIGS.1-7 may vary. For example, the number of iterations in process 200 inFIG. 2, process 500 in FIG. 5, process 300 in FIG. 3, and process 600 inFIG. 6 may all vary and may be independent. In a non-limiting example,the number of iterations in process 600 for a client device may be inthe range of 10-100, and the number of iterations in processes, 200, 300or 500 for a host device may be 1000. Other values may also be possible.

FIG. 8 depicts an example of internal hardware that may be included inany electronic device or computing system for implementing variousmethods in the embodiments described in FIGS. 1-7. An electrical bus 800serves as an information highway interconnecting the other illustratedcomponents of the hardware. Processor 805 is a central processing deviceof the system, configured to perform calculations and logic operationsrequired to execute programming instructions. As used in this documentand in the claims, the terms “processor” and “processing device” mayrefer to a single processor or any number of processors in a set ofprocessors that collectively perform a process, whether a centralprocessing unit (CPU) or a graphics processing unit (GPU) or acombination of the two. Read only memory (ROM), random access memory(RAM), flash memory, hard drives, and other devices capable of storingelectronic data constitute examples of memory devices 825. A memorydevice, also referred to as a computer-readable medium, may include asingle device or a collection of devices across which data and/orinstructions are stored.

An optional display interface 830 may permit information from the bus800 to be displayed on a display device 835 in visual, graphic, oralphanumeric format. An audio interface and audio output (such as aspeaker) also may be provided. Communication with external devices mayoccur using various communication ports 840 such as a transmitter and/orreceiver, antenna, an RFID tag and/or short-range, or near-fieldcommunication circuitry. A communication port 840 may be attached to acommunications network, such as the Internet, a local area network, or acellular telephone data network.

The hardware may also include a user interface sensor 845 that allowsfor receipt of data from input devices 850 such as a keyboard, a mouse,a joystick, a touchscreen, a remote control, a pointing device, a videoinput device, and/or an audio input device, such as a microphone.Digital image frames may also be received from an imaging capturingdevice 855 such as a video or camera that can either be built-in orexternal to the system. Other environmental sensors 860, such as a GPSsystem and/or a temperature sensor, may be installed on system andcommunicatively accessible by the processor 805, either directly or viathe communication ports 840. The communication ports 840 may alsocommunicate with the AI chip to upload or retrieve data to/from thechip. For example, the optimal AI model obtained from process 200 may beshared by all of the processing devices on the network. Any device onthe network may receive the optimal AI model from the network and uploadthe optimal AI model, e.g., CNN weights, to the AI chip for performingan AI task via the communication port 840 and an SDK (softwaredevelopment kit). The communication port 840 may also communicate withany other interface circuit or device that is designed for communicatingwith an integrated circuit.

Optionally, the hardware may not need to include a memory, but insteadprogramming instructions are run on one or more virtual machines or oneor more containers on a cloud. For example, the various methodsillustrated above may be implemented by a server on a cloud thatincludes multiple virtual machines, each virtual machine having anoperating system, a virtual disk, virtual network and applications, andthe programming instructions for implementing various functions in therobotic system may be stored on one or more of those virtual machines onthe cloud.

Various embodiments described above may be implemented and adapted tovarious applications. For example, the AI chip having a CeNNarchitecture may be residing in an electronic mobile device. Theelectronic mobile device may use the built-in AI chip to producerecognition results and generate performance values. In some scenarios,obtaining the CNN can be done in the mobile device itself, where themobile device retrieves training data from a dataset and uses thebuilt-in AI chip to perform the training. In other scenarios, theprocessing device may be a server device in the communication network(e.g., 102 in FIG. 1) or may be on the cloud. These are only examples ofapplications in which an AI task can be performed in the AI chip.

The various systems and methods disclosed in this patent documentprovide advantages over the prior art, whether implemented standalone orcombined. For example, using the systems and methods described in FIGS.1-8 may help obtain the optimal AI model using multiple networkeddevices and multiple AI chips in a parallel configuration. The variousdevices in the parallel configuration may communicate with each other ineither centralized or decentralized or distributed network. Thisparallel configuration and networked approach help the system to narrowthe search space of the AI model during the training process thus thesystem may converge to the optimal AI model faster. Furthermore, theparticular formulae for accept/reject in the examples may reduceoverfitting. The above disclosed embodiments also allow differentsubsets of training data to be shuffled to obtain a local optimal AImodel for each AI chip. In some examples, one or more functions in theprocess 200 (FIG. 2) or the process 500 (FIG. 5) may be implemented in ahost device and multiple client devices. Alternatively, and/oradditionally, the one or more functions in these processes may also beimplemented in a single or multiple host devices, and/or a single ormultiple client devices. For example, as shown in FIG. 1, a host device116 may include multiple AI chips. In such case, all of the functions inFIG. 2, for example, may be implemented in the host device 116, whereasrunning AI chips (e.g., 210) may be directly performed on one or morephysical AI chips under the host device 116. Above illustratedembodiments are described in the context of generating a CNN model foran AI chip (physical or virtual), but can also be applied to variousother applications. For example, the current solution is not limited toimplementing the CNN but can also be applied to other algorithms orarchitectures inside an AI chip.

It will be readily understood that the components of the presentsolution as generally described herein and illustrated in the appendedfigures could be arranged and designed in a wide variety of differentconfigurations. Thus, the detailed description of variousimplementations, as represented herein and in the figures, is notintended to limit the scope of the present disclosure, but is merelyrepresentative of various implementations. While the various aspects ofthe present solution are presented in drawings, the drawings are notnecessarily drawn to scale unless specifically indicated.

The present solution may be embodied in other specific forms withoutdeparting from its spirit or essential characteristics. The describedembodiments are to be considered in all respects only as illustrativeand not restrictive. The scope of the present solution is, therefore,indicated by the appended claims rather than by this detaileddescription. All changes which come within the meaning and range ofequivalency of the claims are to be embraced within their scope.

Reference throughout this specification to features, advantages, orsimilar language does not imply that all of the features and advantagesthat may be realized with the present solution should be or are in anysingle embodiment thereof. Rather, language referring to the featuresand advantages is understood to mean that a specific feature, advantage,or characteristic described in connection with an embodiment is includedin at least one embodiment of the present solution. Thus, discussions ofthe features and advantages, and similar language, throughout thespecification may, but do not necessarily, refer to the same embodiment.

Furthermore, the described features, advantages, and characteristics ofthe present solution may be combined in any suitable manner in one ormore embodiments. One ordinarily skilled in the relevant art willrecognize, in light of the description herein, that the present solutioncan be practiced without one or more of the specific features oradvantages of a particular embodiment. In other instances, additionalfeatures and advantages may be recognized in certain embodiments thatmay not be present in all embodiments of the present solution.

Other advantages can be apparent to those skilled in the art from theforegoing specification. Accordingly, it will be recognized by thoseskilled in the art that changes, modifications, or combinations may bemade to the above-described embodiments without departing from the broadinventive concepts of the invention. It should therefore be understoodthat the present solution is not limited to the particular embodimentsdescribed herein, but is intended to include all changes, modifications,and all combinations of various embodiments that are within the scopeand spirit of the invention as defined in the claims.

We claim:
 1. A system comprising: a plurality of artificial intelligence(AI) chips; and a processing device communicatively coupled to theplurality of AI chips and configured to: (i) receive a respective ATmodel and an associated performance value of the respective AI modelfrom each of the plurality of AI chips, wherein the respective AI modelfrom an AI chip of the plurality of AI chips is obtained based at leaston the respective AI models from other AI chips of the plurality of AIchips; (ii) determine an optimal AI model that has a best performancevalue among the performance values associated with the respective AImodels from the plurality of AI chips; and (iii) repeat (i)-(ii) for afirst number of iterations; and (iv) upon completion of the first numberof iterations, output the optimal AI model.
 2. The system of claim I,wherein the processing device is configured to output the optimal AImodel by loading one or more parameters of the optimal AI model into anembedded cellular neural network architecture of an AI integratedcircuit, the AI integrated circuit is coupled to a sensor and configuredto: receive data captured from the sensor; and perform an AI task basedon the captured data and the optimal AI model in the embedded cellularneural network architecture.
 3. The system of claim 1, wherein each ofthe plurality of AI chips is configured to obtain respective trainingdata, wherein the respective training data in the plurality of AI chipsare non-overlapping subsets of a training dataset.
 4. The system ofclaim 3, wherein each of the plurality of AI chips is further configuredto generate the respective AI model by: (v) generating an updated AImodel based at least on a current AI model in a preceding iteration;(vi) determining whether to accept the updated AI model; and (vii) upondetermining to accept the updated AI model, outputting the updated AImodel; otherwise, updating the current AI model with the updated AImodel and repeating steps (v)-(vii).
 5. The system of claim 4, whereinthe processing device is further configured to, after outputting theoptimal AI model, further validate the optimal AI model.
 6. The systemof claim 4, wherein an AI chip of the plurality of AI chips isconfigured to determine whether to accept the updated AI model based atleast on the respective Al models of other AI chips of the plurality ofAI chips.
 7. The system of claim 6, wherein the AI chip is furtherconfigured to determine whether to accept the updated AI model based atleast on a performance value of the updated AI model, wherein theperformance value of the updated AI model is generated by an embeddedcellular neural network (CeNN) in the AI chip using the updated AI modeland the respective training data for the AI chip.
 8. The system of claim6, wherein the AI chip is further configured to determine whether toaccept the updated AI model based at least on a performance value of thecurrent AI model in the preceding iteration, wherein the performancevalue of the current AI model is generated by the embedded CeNN in theAI chip using the current AI model and the respective training data forthe AI chip.
 9. The system of claim 4, wherein the processing device isfurther configured to provide or update training configurationparameters to the plurality of AI chips before receiving the respectiveAI model from each of the plurality of AI chips, and wherein each of theplurality of AI chips is configured to generate the updated AI modelbased at least on the training configuration parameters.
 10. The systemof claim 3, wherein each of the plurality of AI chips is furtherconfigured to generate the respective AI model by: (v) generating anupdated AI model based on an updated AI model from a previous iteration;(vi) determining whether to accept the updated AI model; and (vii) upondetermining to accept the updated AI model, determining a local optimalAI model; (viii) repeating (v)-(vii) for a second number of iterations;and (ix) upon completion of the second number of iterations, output thelocal optimal AI model.
 11. The system of claim 10, wherein theprocessing device is further configured to, after outputting the optimalAI model, validate the optimal AI model.
 12. A method comprising, by aprocessing device: (i) receiving a respective artificial intelligence(Al) model and an associated performance value of the respective AImodel from each of a plurality of AI chips, wherein the respective AImodel from an AI chip of the plurality of AI chips is obtained based atleast on the respective AI models from other AI chips of the pluralityof AI chips; (ii) determining an optimal AI model that has a bestperformance value among the performance values associated with therespective AI models from the plurality of AI chips; and (iii) repeating(i)-(ii) for a first number of iterations; and (iv) upon completion ofthe first number of iterations, outputting the optimal AI model.
 13. Themethod of claim 12, wherein outputting the optimal AI model comprisesloading one or more parameters of the optimal AI model into an embeddedcellular neural network architecture of an AI integrated circuit,wherein the AI integrated circuit is coupled to a sensor and configuredto: receive data captured from the sensor; and perform an AI task basedon the captured data and the optimal AI model in the embedded cellularneural network architecture.
 14. The method of claim 12 furthercomprising obtaining respective training data for each of the pluralityof AI chips, wherein the respective training data in the plurality of AIchips are non-overlapping subsets of a training dataset.
 15. The methodof claim 14 further comprising, at each of the plurality of AI chips,generating the respective AI model by: (v) generating an updated AImodel based at least on a current AI model in a preceding iteration;(vi) determining whether to accept the updated AI model; and (vii) upondetermining to accept the updated AI model, outputting the updated AImodel; otherwise, updating the current AI model with the updated AImodel and repeating steps (v)-(vii).
 16. The method of claim 15 furthercomprising validating the optimal AI model after outputting the optimalAI model.
 17. The method of claim 14, wherein determining whether toaccept the updated AI model is based at least on the respective AImodels of other AI chips of the plurality of AI chips.
 18. The method ofclaim 17, wherein determining whether to accept the updated AI model isbased at least on a performance value of the updated AI model, whereinthe performance value of the updated AI model is generated by anembedded cellular neural network (CeNN) in the AI chip using the updatedAI model and the respective training data for the AI chip.
 19. Themethod of claim 17, wherein determining whether to accept the updated AImodel is based at least on a performance value of the current AI modelin the preceding iteration, wherein the performance value of the currentAI model is generated by the embedded CeNN in the AI chip using thecurrent AI model and the respective training data for the AI chip. 20.The method of claim 15 further comprising providing or updating trainingconfiguration parameters to the plurality of AI chips before receivingthe respective AI model from each of the plurality of AI chips, whereingenerating the updated AI model at each of the plurality of AI chips isbased at least on the training configuration parameters.
 21. The methodof claim 14 further comprising, at each of the plurality of AI chips,generating the respective AI model by: (v) generating an updated AImodel based on an updated AI model from a previous iteration; (vi)determining whether to accept the updated AI model; and (vii) upondetermining to accept the updated AI model, determining a local optimalAI model; (viii) repeating (v)-(vii) for a second number of iterations;and (ix) upon completion of the second number of iterations, output thelocal optimal AI model.